Abstract: Denoising diffusion models were designed with a simple forward process yet brought challenges for efficient sampling. Instead of striving for an accelerated sampler, we propose new bilateral denoising diffusion models (BDDMs) that parameterize the forward and reverse processes, with a score network and a scheduling network, respectively. From a bilateral modeling objective, we derive a tighter lower bound as a surrogate objective for the likelihood to achieve exceedingly high-quality and fast generation compared to other cutting-edge samplers. In particular, with a negligible training overhead, the proposed BDDMs generated significantly higher-quality samples with a 62x inference speed up relative to the denoising diffusion probabilistic models.
Fast and high-fidelity speech generation using a 7-step noise schedule estimated by BDDMs:
Note: Consider reducing the volume for the first few iterations below as they are mostly white noise.
Text: Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition.
| Reverse Step 1: | LS-MSE=3393, MCD=6.55, STOI=0.405, PESQ=0.641 |
|
|
| Reverse Step 2: | LS-MSE=2706, MCD=6.19, STOI=0.493, PESQ=0.681 |
|
|
| Reverse Step 3: | LS-MSE=1805, MCD=5.58, STOI=0.646, PESQ=1.06 |
|
|
| Reverse Step 4: | LS-MSE=1080, MCD=4.86, STOI=0.813, PESQ=1.58 |
|
|
| Reverse Step 5: | LS-MSE=584, MCD=4.01, STOI=0.928, PESQ=2.23 |
|
|
| Reverse Step 6: | LS-MSE=284, MCD=3.10, STOI=0.973, PESQ=2.87 |
|
|
| Reverse Step 7: | LS-MSE=77.2, MCD=1.94, STOI=0.984, PESQ=4.02 |
|
Comparing Convergence Rates of Different Noise Schedules
LJ speech samples from different generative diffusion models:
Note: Different rows correspond to different noise schedules or sampling methods for inference.
| Text | and having, quote, somewhat bushy, end quote, hair. | since a disclosure of such detailed information relating to protective measures might undermine present methods of protecting the President. | since a disclosure of such detailed information relating to protective measures might undermine present methods of protecting the President. |
| Ground Truth | |||
| DDPM - 8 steps (Grid Search) | |||
| DDPM - 1000 steps (Linear) | |||
| DDIM - 8 steps (Linear) | |||
| DDIM - 100 steps (Linear) | |||
| NE - 8 steps (Linear) | |||
| Ours BDDM - 8 steps |
VCTK samples from different generative diffusion models:
Note: Different rows correspond to different noise schedules or sampling methods for inference.
| Text | Frankly, we should all have such problems. | I felt he was excellent. | Frankly, we should all have such problems. |
| Ground Truth | |||
| DDPM - 8 steps (Grid Search) | |||
| DDPM - 1000 steps (Linear) | |||
| DDIM - 8 steps (Linear) | |||
| DDIM - 100 steps (Linear) | |||
| NE - 8 steps (Linear) | |||
| Ours BDDM - 8 steps |
CIFAR-10 samples generated from BDDM:
| BDDM - 10 steps | BDDM - 20 steps | BDDM - 100 steps |
|
|
|